Let us load the necessary library required for this homework.
suppressPackageStartupMessages(library(tidyverse))
## Warning: replacing previous import by 'tibble::as_tibble' when loading
## 'broom'
## Warning: replacing previous import by 'tibble::tibble' when loading 'broom'
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(gridExtra))
We shall be using the gapminder data set for this exercise.
We begin by removing the rows for the continent Oceania from the gapminder data set. Before we do this, let us use the str() function to check the number of factors in the gapminder data set and the number of level each factor has.
gapminder %>% #loads the gapminder data set and pipes it into the function in the next line
str()
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
This ouput shows that the gapminder data set has two factors; country and continent. Country has 142 levels, while continent has 5 levels. Also we see that this data set has 1704 rows and 6 columns. We can visualized this data using a barchart as shown below;
gapminder %>% # loads the gapminder data and pipes it into the next line
ggplot(aes(continent)) + geom_bar() + # use ggplot to produce a barhcart
ggtitle("The number of observations for each continent")
Let us arrange this data according to the population in ascending and plot a barchart of the new data set. We want to see if the arrange() function has an effect on the plot.
ArrGap <- gapminder %>% # loads the gapminder data and pipes it into the next line
arrange(pop) # arranges the data according to the population in ascending order
# display few rows of data
ArrGap %>% # loads the ArrGap data and pipes it into the next line
head() %>%
knitr::kable()
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Sao Tome and Principe | Africa | 1952 | 46.471 | 60011 | 879.5836 |
| Sao Tome and Principe | Africa | 1957 | 48.945 | 61325 | 860.7369 |
| Djibouti | Africa | 1952 | 34.812 | 63149 | 2669.5295 |
| Sao Tome and Principe | Africa | 1962 | 51.893 | 65345 | 1071.5511 |
| Sao Tome and Principe | Africa | 1967 | 54.425 | 70787 | 1384.8406 |
| Djibouti | Africa | 1957 | 37.328 | 71851 | 2864.9691 |
# plotting the data
ArrGap %>%
ggplot(aes(continent)) + geom_bar() + # use ggplot to produce a barhcart
ggtitle("The number of observations for each continent")
Observe from the table and figures above that the arrange() function re-arranges the rows of the data but this re-arrangement does not affect the figure produced after arranging the data.
Now, let us extract out data from the gapminder data set.
# create a subset of gapminder data that does not contain Oceania, call this data set 'GapNoOcean'
GapNoOcean <- gapminder %>% #loads the data set and pipes it into the function in the next line
filter(continent != "Oceania")
# displaying table nicely
head(GapNoOcean,15) %>%
knitr::kable()
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
| Afghanistan | Asia | 1982 | 39.854 | 12881816 | 978.0114 |
| Afghanistan | Asia | 1987 | 40.822 | 13867957 | 852.3959 |
| Afghanistan | Asia | 1992 | 41.674 | 16317921 | 649.3414 |
| Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3414 |
| Afghanistan | Asia | 2002 | 42.129 | 25268405 | 726.7341 |
| Afghanistan | Asia | 2007 | 43.828 | 31889923 | 974.5803 |
| Albania | Europe | 1952 | 55.230 | 1282697 | 1601.0561 |
| Albania | Europe | 1957 | 59.280 | 1476505 | 1942.2842 |
| Albania | Europe | 1962 | 64.820 | 1728137 | 2312.8890 |
Let us use the str() function to check the factors and their levels of our new data set.
GapNoOcean %>% #loads the data set and pipes it into the function in the next line
str()
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
After removing the rows for the countries in Oceania from the gapminder data set, the new data set has 1680 rows, although, the number of factors in the new data frame still remains 5. Let us check the levels in the continent factor again using the level() function.
levels(GapNoOcean$continent) # displays the number of levelsof the continent factor
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
This also shows that there are 5 level, it shows both the used and unused levels.
GapNoOcean %>% #loads the data set and pipes it into the function in the next line
ggplot(aes(continent)) + geom_bar() + # plots a barchart of the data
scale_x_discrete(drop=FALSE) + ggtitle("The number of observations for each continent")
This plot shows that the continent Oceania is still a level in the data, althogh it has no entries. Let us arrange this new data set according to the number of observations in ascending order.
GapNoOcean %>% #loads the data set and pipes it into the function in the next line
arrange(pop) %>% # arranges the data according to the population size in ascending order
ggplot(aes(continent)) + geom_bar() + # plots barchart
ggtitle("The number of observations for each continent")
After arranging re-arranging the data, the Oceania continent is dropped from the barchart.
Let us remove unused levels in the continent.
GapNoOceanD <- GapNoOcean %>%
droplevels() # drops the unused levels
str(GapNoOceanD) # displays the structure of the data
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
levels(GapNoOceanD$continent) # shows the levels in the continent factor
## [1] "Africa" "Americas" "Asia" "Europe"
It now shows that the continent factor has only 4 levels.
There are other ways we can drop unused levels, let us check them out!
forcats::fct_drop(GapNoOcean$continent) %>% # drops unused levels
levels() # shows the levels in the continent factor
## [1] "Africa" "Americas" "Asia" "Europe"
fct_drop(GapNoOcean$continent) %>% # drops unused levels
levels() # shows the levels in the continent factor
## [1] "Africa" "Americas" "Asia" "Europe"
Let us start by plotting a barchart of the GapNoOceanD data.
GapNoOceanD %>% # loads the data set and pipes it into the next line
ggplot(aes(continent)) + geom_bar() + # plots the barchart
scale_x_discrete(drop=FALSE) + ggtitle("The number of observations for each continent")
The continent factor can be reordered in descending order of freqency on the barchart;
GapNoOceanD$continent %>% # loads the data set and pipes it into the next line
fct_infreq() %>% # reorders level in descending order of frequency
qplot() + ggtitle("The number of observations for each continent in descending order") # plots barchart
This ordering is reversed using the forcats::fct_rev() function.
GapNoOceanD$continent %>% # loads the data set and pipes it into the next line
forcats::fct_infreq() %>% # reorders level in descending order of frequency
forcats::fct_rev() %>% # reverses the previous order
qplot() + ggtitle("The number of observations for each continent after reversing level")
Let us present a boxplot of the reordering of continents based on some principled summary of pop.
GapNoOceanD %>%
ggplot( aes(continent,pop)) + geom_boxplot() +
scale_y_log10() + ggtitle("Boxplot of population for each continent ordered by the median population")
GapNoOceanD %>%
group_by(continent) %>%
mutate(median = median((pop))) %>%
arrange(median) %>%
ggplot( aes(continent,pop)) + geom_boxplot(fill='pink') +
scale_y_log10() + ggtitle("Boxplot of population for each continent ordered by the median population")
GapNoOceanD %>%
mutate(continent= fct_reorder(continent,pop)) %>%
ggplot( aes(continent,pop)) + geom_boxplot(fill='pink') +
scale_y_log10() + ggtitle("Boxplot of population for each continent ordered by the median population")
GapNoOceanD %>% # loads data and pipes it to the next line
mutate(continent= fct_reorder(continent,pop)) %>% # reorders the data
ggplot( aes(continent,pop)) + geom_boxplot() + # plots a boxplot
scale_y_log10() + ggtitle("Population of each continent reordered by median population")
GapNoOceanD %>% # loads data and pipes it to the next line
mutate(continent= fct_reorder(continent,pop, mean)) %>% # reorders the data
ggplot( aes(continent, pop)) + geom_boxplot() + # plots a boxplot
scale_y_log10() + ggtitle("Population of each continent reordered by mean population")
GapNoOceanD %>% # loads data and pipes it to the next line
mutate(continent= fct_reorder(continent,pop, max)) %>% # reorders the data
ggplot( aes(continent, pop)) + geom_boxplot() + # plots a boxplot
scale_y_log10() + ggtitle("Population of each continent reordered by maximum population")
GapNoOceanD %>% # loads data and pipes it to the next line
mutate(continent= fct_reorder(continent,pop, min )) %>% # reorders the data
ggplot( aes(continent,pop)) + geom_boxplot() + # plots a boxplot
scale_y_log10() + ggtitle("Population of each continent reordered by minimum population")
First, let us extract the data set to be used for this exercise from the gapminder data set.
NewGap <- gapminder %>% # loads data and pipes it to the next line
filter(continent == "Europe"|continent == "Asia"|continent == "Africa"|continent == "Oceania") %>%
arrange(gdpPercap)
str(NewGap) # displays structure of new data set
## Classes 'tbl_df', 'tbl' and 'data.frame': 1404 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 28 28 74 53 28 42 88 74 18 42 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 3 1 1 1 ...
## $ year : int 2002 2007 1952 1952 1997 1952 1952 1957 1952 1957 ...
## $ lifeExp : num 45 46.5 42.1 32.5 42.6 ...
## $ pop : int 55379852 64606759 748747 580653 47798986 1438760 20092996 813338 2445618 1542611 ...
## $ gdpPercap: num 241 278 299 300 312 ...
# display few rows of new data set
NewGap %>% # loads data and pipes it to the next line
head() %>%
knitr::kable()
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Congo, Dem. Rep. | Africa | 2002 | 44.966 | 55379852 | 241.1659 |
| Congo, Dem. Rep. | Africa | 2007 | 46.462 | 64606759 | 277.5519 |
| Lesotho | Africa | 1952 | 42.138 | 748747 | 298.8462 |
| Guinea-Bissau | Africa | 1952 | 32.500 | 580653 | 299.8503 |
| Congo, Dem. Rep. | Africa | 1997 | 42.587 | 47798986 | 312.1884 |
| Eritrea | Africa | 1952 | 35.928 | 1438760 | 328.9406 |
Let us create a boxplot of the extracted data and name it Default.
Default <-NewGap %>% # loads data and pipes it to the next line
ggplot( aes(continent,gdpPercap)) + geom_boxplot(fill='green') + # plots a boxplot
scale_y_log10() + ggtitle("Without factor re-order") # scales the y-axis and put title on the graph
Default
Now, we reorder the factor levels and create another boxplot.
NewGapReorder <-NewGap %>% # loads data and pipes it to the next line
mutate(continent= fct_reorder(continent,gdpPercap,max)) # reorders the level of the continent using the max population
P <- NewGapReorder %>% # loads data and pipes it to the next line
ggplot( aes(continent,gdpPercap)) + geom_boxplot(fill='pink') + # produces a boxplot
scale_y_log10() + ggtitle("Re-ordering factor by max. GdpPercap")
# put the two boxplots side-by-side
grid.arrange(Default,P,ncol=2,top="Comparing the Boxplot of GdpPerCap for the continents with and without factor re-order")
Next, we display the ordering of the levels of continent.
levels(NewGapReorder$continent) # displays the levels of the selected factor the way the are ordered
## [1] "Africa" "Oceania" "Europe" "Asia" "Americas"
Let us create another plot of the reorder data and assign it to the variable PP. This plot will be places side-by-side to the new plot generated from the data read from files in order to determine which of the writing and reading format preserves ordering.
GapOrdered <-NewGapReorder %>% # loads data and pipes it to the next line
mutate(continent= fct_reorder(continent,gdpPercap,max)) %>% # reorder continent by max population
ggplot( aes(continent,gdpPercap)) + geom_boxplot(fill='pink') + # produces boxplot
scale_y_log10() + ggtitle("Data written to file")
read_csv() and write_csv()Writing the reordered data to file.
write_csv(NewGapReorder,"NewGap") # writes the gapminder data set into a file called 'MyGapMinder', in .csv format
Reading back from file.
# reading data from file
GapCSV <- read_csv("NewGap") # reads the gapminder file in csv format
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_integer(),
## lifeExp = col_double(),
## pop = col_integer(),
## gdpPercap = col_double()
## )
# displays the structure
GapCSV %>%
str()
## Classes 'tbl_df', 'tbl' and 'data.frame': 1404 obs. of 6 variables:
## $ country : chr "Congo, Dem. Rep." "Congo, Dem. Rep." "Lesotho" "Guinea-Bissau" ...
## $ continent: chr "Africa" "Africa" "Africa" "Africa" ...
## $ year : int 2002 2007 1952 1952 1997 1952 1952 1957 1952 1957 ...
## $ lifeExp : num 45 46.5 42.1 32.5 42.6 ...
## $ pop : int 55379852 64606759 748747 580653 47798986 1438760 20092996 813338 2445618 1542611 ...
## $ gdpPercap: num 241 278 299 300 312 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 6
## .. ..$ country : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ continent: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ year : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ lifeExp : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ pop : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ gdpPercap: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
# display in table
GapCSV %>%
head() %>%
knitr::kable()
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Congo, Dem. Rep. | Africa | 2002 | 44.966 | 55379852 | 241.1659 |
| Congo, Dem. Rep. | Africa | 2007 | 46.462 | 64606759 | 277.5519 |
| Lesotho | Africa | 1952 | 42.138 | 748747 | 298.8462 |
| Guinea-Bissau | Africa | 1952 | 32.500 | 580653 | 299.8503 |
| Congo, Dem. Rep. | Africa | 1997 | 42.587 | 47798986 | 312.1884 |
| Eritrea | Africa | 1952 | 35.928 | 1438760 | 328.9406 |
Observe that the structure is not the same as the data written to file, the factors continent and country are now character, although the table is still the same.
P2 <- GapCSV %>%
ggplot( aes(continent,gdpPercap)) + geom_boxplot() +
scale_y_log10() + ggtitle("Data read from file")
grid.arrange(GapOrdered,P2,ncol=2,top="Comparing the boxplots otained from data written to file and the one from data read from file")
read_delim() and write_delim()Writing to file:
write_delim(NewGapReorder,"NewGap.txt",delim = "$") # writes the data to file
Reading from file:
GapTxt <- read_delim("NewGap.txt", delim="$") # reads data from file
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_integer(),
## lifeExp = col_double(),
## pop = col_integer(),
## gdpPercap = col_double()
## )
str(GapTxt) # display the structure of the data
## Classes 'tbl_df', 'tbl' and 'data.frame': 1404 obs. of 6 variables:
## $ country : chr "Congo, Dem. Rep." "Congo, Dem. Rep." "Lesotho" "Guinea-Bissau" ...
## $ continent: chr "Africa" "Africa" "Africa" "Africa" ...
## $ year : int 2002 2007 1952 1952 1997 1952 1952 1957 1952 1957 ...
## $ lifeExp : num 45 46.5 42.1 32.5 42.6 ...
## $ pop : int 55379852 64606759 748747 580653 47798986 1438760 20092996 813338 2445618 1542611 ...
## $ gdpPercap: num 241 278 299 300 312 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 6
## .. ..$ country : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ continent: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ year : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ lifeExp : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ pop : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ gdpPercap: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
# display few row in table
GapTxt %>%
head() %>%
knitr::kable()
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Congo, Dem. Rep. | Africa | 2002 | 44.966 | 55379852 | 241.1659 |
| Congo, Dem. Rep. | Africa | 2007 | 46.462 | 64606759 | 277.5519 |
| Lesotho | Africa | 1952 | 42.138 | 748747 | 298.8462 |
| Guinea-Bissau | Africa | 1952 | 32.500 | 580653 | 299.8503 |
| Congo, Dem. Rep. | Africa | 1997 | 42.587 | 47798986 | 312.1884 |
| Eritrea | Africa | 1952 | 35.928 | 1438760 | 328.9406 |
Similar to .csv format, the structure of the data read from file is not the same as the data written to file, the factors continent and country are now character, although the table is still the same. We can also verify this from the plots.
P3 <- GapTxt %>%
ggplot( aes(continent,gdpPercap)) + geom_boxplot() +
scale_y_log10() + ggtitle("Data read from file")
grid.arrange(GapOrdered,P3,ncol=2,top="Comparing the boxplots otained from data written to file and the one from data read from file")
In addition, we can write data into a file in such a way that each of the columns are separated by tab. This is the TSV format, meaning tab separated format
read_delim() and write_delim() separated by tabWriting data to file:
write_delim(NewGapReorder,"NewGap.tsv",delim = "\t") # writes the data to file
Reading data from file:
GapTab <- read_delim("NewGap.tsv", delim="\t") # reads the gapminder file in csv format
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_integer(),
## lifeExp = col_double(),
## pop = col_integer(),
## gdpPercap = col_double()
## )
str(GapTab) # display structure of data
## Classes 'tbl_df', 'tbl' and 'data.frame': 1404 obs. of 6 variables:
## $ country : chr "Congo, Dem. Rep." "Congo, Dem. Rep." "Lesotho" "Guinea-Bissau" ...
## $ continent: chr "Africa" "Africa" "Africa" "Africa" ...
## $ year : int 2002 2007 1952 1952 1997 1952 1952 1957 1952 1957 ...
## $ lifeExp : num 45 46.5 42.1 32.5 42.6 ...
## $ pop : int 55379852 64606759 748747 580653 47798986 1438760 20092996 813338 2445618 1542611 ...
## $ gdpPercap: num 241 278 299 300 312 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 6
## .. ..$ country : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ continent: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ year : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ lifeExp : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ pop : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ gdpPercap: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
# display few rows of data in table
GapTab %>%
head() %>%
knitr::kable()
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Congo, Dem. Rep. | Africa | 2002 | 44.966 | 55379852 | 241.1659 |
| Congo, Dem. Rep. | Africa | 2007 | 46.462 | 64606759 | 277.5519 |
| Lesotho | Africa | 1952 | 42.138 | 748747 | 298.8462 |
| Guinea-Bissau | Africa | 1952 | 32.500 | 580653 | 299.8503 |
| Congo, Dem. Rep. | Africa | 1997 | 42.587 | 47798986 | 312.1884 |
| Eritrea | Africa | 1952 | 35.928 | 1438760 | 328.9406 |
Here also, the structure of the data read from file is not the same as the data written to file, the factors continent and country are now character, although the table is still the same. Let display a boxplot of the data side-by-side.
P4 <- GapTab %>%
ggplot( aes(continent,gdpPercap)) + geom_boxplot() +
scale_y_log10() + ggtitle("Data read from file")
grid.arrange(GapOrdered,P4,ncol=2,top="Comparing the boxplots otained from data written to file and the one from data read from file")
readRDS and saveRDS()Writing to file:
saveRDS(NewGapReorder,"NewGap.rds") # writes the data set in compressed format (.rds format)
This functions writes data to file in a compressed format.
GapRDS <- readRDS("NewGap.rds") # reads from file
str(GapRDS) # display the structure
## Classes 'tbl_df', 'tbl' and 'data.frame': 1404 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 28 28 74 53 28 42 88 74 18 42 ...
## $ continent: Factor w/ 5 levels "Africa","Oceania",..: 1 1 1 1 1 1 4 1 1 1 ...
## $ year : int 2002 2007 1952 1952 1997 1952 1952 1957 1952 1957 ...
## $ lifeExp : num 45 46.5 42.1 32.5 42.6 ...
## $ pop : int 55379852 64606759 748747 580653 47798986 1438760 20092996 813338 2445618 1542611 ...
## $ gdpPercap: num 241 278 299 300 312 ...
# display few rows of data in table
GapRDS %>%
head() %>%
knitr::kable()
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Congo, Dem. Rep. | Africa | 2002 | 44.966 | 55379852 | 241.1659 |
| Congo, Dem. Rep. | Africa | 2007 | 46.462 | 64606759 | 277.5519 |
| Lesotho | Africa | 1952 | 42.138 | 748747 | 298.8462 |
| Guinea-Bissau | Africa | 1952 | 32.500 | 580653 | 299.8503 |
| Congo, Dem. Rep. | Africa | 1997 | 42.587 | 47798986 | 312.1884 |
| Eritrea | Africa | 1952 | 35.928 | 1438760 | 328.9406 |
Unlike, the previous format we have seen, this format preserves the structure of the data when reading from file. Let us confirm that the level are still in the same order as it was written to the file.
levels(GapRDS$continent) # displays the levels of the selected factor in proper order
## [1] "Africa" "Oceania" "Europe" "Asia" "Americas"
Observe that the ordering is also preserved. Yay! Let us check the boxplots:
P5 <- GapRDS %>%
ggplot( aes(continent,gdpPercap)) + geom_boxplot() +
scale_y_log10() + ggtitle("Data read from file")
grid.arrange(GapOrdered,P5,ncol=2,top="Comparing the boxplots otained from data written to file and the one from data read from file")
The ordering is also preserved in the figures.
dput() and dget() functionsWriting to file:
dput(NewGapReorder,"MyGapMinderDput.txt") # writes data to file
Reading from file:
GapDput <- dget("MyGapMinderDput.txt") # reads data from file
str(GapDput) # displays structure
## Classes 'tbl_df', 'tbl' and 'data.frame': 1404 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 28 28 74 53 28 42 88 74 18 42 ...
## $ continent: Factor w/ 5 levels "Africa","Oceania",..: 1 1 1 1 1 1 4 1 1 1 ...
## $ year : int 2002 2007 1952 1952 1997 1952 1952 1957 1952 1957 ...
## $ lifeExp : num 45 46.5 42.1 32.5 42.6 ...
## $ pop : int 55379852 64606759 748747 580653 47798986 1438760 20092996 813338 2445618 1542611 ...
## $ gdpPercap: num 241 278 299 300 312 ...
# display few rows in table
GapDput %>%
head() %>%
knitr::kable()
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Congo, Dem. Rep. | Africa | 2002 | 44.966 | 55379852 | 241.1659 |
| Congo, Dem. Rep. | Africa | 2007 | 46.462 | 64606759 | 277.5519 |
| Lesotho | Africa | 1952 | 42.138 | 748747 | 298.8462 |
| Guinea-Bissau | Africa | 1952 | 32.500 | 580653 | 299.8503 |
| Congo, Dem. Rep. | Africa | 1997 | 42.587 | 47798986 | 312.1884 |
| Eritrea | Africa | 1952 | 35.928 | 1438760 | 328.9406 |
The structure of the data is also preserved with this format. How about we plot a boxplot of the data read from file and compare with the boxplot of the data written to file?
P6 <- GapDput %>%
ggplot( aes(continent,gdpPercap)) + geom_boxplot() +
scale_y_log10() + ggtitle("Data read from file")
grid.arrange(GapOrdered,P6,ncol=2,top="Comparing the boxplots otained from data written to file and the one from data read from file")
These figures are identical, which confirms that the structure of the data is preserved.
We begin by loading the necessary libraries required for this exercise.
suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(plotly))
FirstAttempt <- gapminder %>%
ggplot(aes(pop,lifeExp)) + geom_point(aes(color=gdpPercap), alpha = 0.3)
Attempt1 <- FirstAttempt + scale_x_log10() + ggtitle("Attempt 1")
Attempt1
Attempt2 <- FirstAttempt + # loads the previous plot
scale_color_continuous(trans="log10", # makes the colorign continuous
breaks = 5^(1:10), # labels the legend in multiples of 5
labels = comma_format() # wtites the legend label in comma separated format
) +
scale_y_continuous(breaks = 10*(1:10)) + # labels the y-axis in multiples of 10
scale_x_continuous(trans = "log10", # puts the x-axis in log10 scale
labels = comma_format(), # put the x-label in comma format
) + ggtitle("Attempt 2: scale_color_continuous")
Attempt2
Attempt3 <- FirstAttempt + # loads the previous plot
scale_color_distiller(trans="log10", # makes the colorign continuous
breaks = 5^(1:10), # labels the legend in multiples of 5
labels = comma_format(), # wtites the legend label in comma separated format
palette = "Reds") +
scale_y_continuous(breaks = 10*(1:10)) + # labels the y-axis in multiples of 10
scale_x_continuous(trans = "log10", # puts the x-axis in log10 scale
labels = comma_format(), # put the x-label in comma format
) + ggtitle("Attempt 3: scale_color_distiller")
Attempt3
Attempt4 <- FirstAttempt + # loads the previous plot
scale_color_viridis_c(trans="log10", # makes the colorign continuous
breaks = 5^(1:10), # labels the legend in multiples of 5
labels = comma_format() # wtites the legend label in comma separated format
) +
scale_y_continuous(breaks = 10*(1:10)) + # labels the y-axis in multiples of 10
scale_x_continuous(trans = "log10", # puts the x-axis in log10 scale
labels = comma_format(), # put the x-label in comma format
) + ggtitle("Attempt 4: scale_color_viridis_c")
Attempt4
Putting all the attmepts together, we have
grid.arrange(Attempt1,Attempt2,Attempt3,Attempt4,ncol=2,top="Different Attempts")
Now let us convert each of our attempts to a plotly
plotly1 <- ggplotly(Attempt1)
## Warning: replacing previous import by 'Rcpp::evalCpp' when loading 'later'
## Warning: replacing previous import by 'shiny::validateCssUnit' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::br' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::tags' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::div' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::h1' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::h2' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::h3' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::h4' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::h5' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::h6' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::knit_print.html' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::tagSetChildren' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::includeScript' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::em' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::tagAppendChild' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::is.singleton' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::includeHTML' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::includeMarkdown' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::code' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::tagList' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::a' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::tagAppendAttributes' when
## loading 'crosstalk'
## Warning: replacing previous import by 'shiny::singleton' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::hr' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::p' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::suppressDependencies' when
## loading 'crosstalk'
## Warning: replacing previous import by 'shiny::tagAppendChildren' when
## loading 'crosstalk'
## Warning: replacing previous import by 'shiny::includeText' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::pre' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::span' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::withTags' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::htmlTemplate' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::img' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::tag' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::includeCSS' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::knit_print.shiny.tag' when
## loading 'crosstalk'
## Warning: replacing previous import by 'shiny::knit_print.shiny.tag.list'
## when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::strong' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::HTML' when loading
## 'crosstalk'
plotly1
plotly4 <- ggplotly(Attempt4)
plotly4
We observe from these figures that the plotly() allows for more interaction with the figures, for example, we can zoom in and out, select part of the figure, save figure as .png file, plus many other interactiveness. It evens shows the value of the variables in the plot when the mouse is placed over each point in the graph.
Let do some activities with the theme() function. First, we plot some figures form the gapminder data, after which we make it more presentable.
lifeExp_vs_Gdp <- gapminder %>%
ggplot(aes(gdpPercap, lifeExp)) + geom_point(aes(color=continent)) +
scale_x_log10() +
facet_wrap(~continent) +
ggtitle( "Life Expectancy vs Gdp per capital for different continents")
lifeExp_vs_Gdp
lifeExp_vs_Gdp +
theme_bw() +
theme(axis.text = element_text(size=16),
strip.text = element_text(size=16,color = "blue"),
strip.background = element_rect(fill = "orange"),
panel.background = element_rect(fill = "pink"))
Let us plot directly using the ‘plot_ly()’ function
gapminder %>%
plot_ly(x = ~gdpPercap,
y = ~pop,
type = "scatter",
colors = 'red',
mode = "markers",
marker=list(color="green" , size=5 , opacity=1.5) ,
opacity = 0.5) %>%
layout(xaxis = list(type = "log"),yaxis = list(type = "log"))
gapminder %>%
plot_ly(x = ~lifeExp,
y = ~gdpPercap,
type = "scatter",
colors = 'red',
mode = "markers",
marker=list(color="red" , size=10 , opacity=0.5) ,
opacity = 0.5) %>%
layout(xaxis = list(type = "log"),yaxis = list(type = "log"))
NewGap_Boxplot <-NewGap %>%
mutate(continent= fct_reorder(continent,gdpPercap,max)) %>%
arrange(gdpPercap) %>%
ggplot( aes(continent,gdpPercap)) + geom_boxplot(fill='pink') +
scale_y_log10() + ggtitle("Boxplot of population for each continent ordered by the median population")
ggsave("Boxplot.png", NewGap_Boxplot)
## Saving 7 x 5 in image
graph
ggsave("Boxplot2.png", plot = NewGap_Boxplot, width = 2, height = 2)
graph
ggsave("Boxplot3.png", plot = NewGap_Boxplot, width = 5, height = 2, limitsize =TRUE, dpi = 300)
graph
ggsave("Boxplot4.pdf", plot = NewGap_Boxplot, width = 5, height = 2, limitsize =TRUE, dpi = 300)
graph
Let us extract the rows for some countries in the gapminder data:
Sport_Gap <- gapminder %>%
filter(year == "1952") %>%
filter(country %in% c("Nigeria", "Canada", "India", "Sweden", "Brazil")) %>%
mutate(Sports = factor(c("Soccer","Ice hockey","Cricket","Biking","Baseball"))) %>%
droplevels()
Sport_Gap %>%
knitr::kable()
| country | continent | year | lifeExp | pop | gdpPercap | Sports |
|---|---|---|---|---|---|---|
| Brazil | Americas | 1952 | 50.917 | 56602560 | 2108.9444 | Soccer |
| Canada | Americas | 1952 | 68.750 | 14785584 | 11367.1611 | Ice hockey |
| India | Asia | 1952 | 37.373 | 372000000 | 546.5657 | Cricket |
| Nigeria | Africa | 1952 | 36.324 | 33119096 | 1077.2819 | Biking |
| Sweden | Europe | 1952 | 71.860 | 7124673 | 8527.8447 | Baseball |
Sport_Gap %>%
str
## Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of 7 variables:
## $ country : Factor w/ 5 levels "Brazil","Canada",..: 1 2 3 4 5
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 2 2 3 1 4
## $ year : int 1952 1952 1952 1952 1952
## $ lifeExp : num 50.9 68.8 37.4 36.3 71.9
## $ pop : int 56602560 14785584 372000000 33119096 7124673
## $ gdpPercap: num 2109 11367 547 1077 8528
## $ Sports : Factor w/ 5 levels "Baseball","Biking",..: 5 4 3 2 1